Boxplots#
This page contains instructions and documentation for creating plots used to visualize curve ensembles.
spaghetti_plot#
Plots a random selection of curves.
Parameters#
client (bigquery.Client): BigQuery client object.
table_name (str): BigQuery table name containing data in ‘dataset.table’ form.
reference_table (str): BigQuery table name containing reference table in ‘dataset.table’ form.
geo_level (str): The name of a column from the reference table. The geographical level used to determine what places are included.
geo_values (str or listlike or None): The source(s) to be included. A value or subset of values from the geo_level column. If None, then all values will be included.
geo_column (str, optional): Name of column in original table containing geography identifier. Defaults to ‘basin_id’.
reference_column (str, optional): Name of column in original table containing the geography corresponding to data in source_column and target_column. Defaults to ‘basin_id’.
value (str, optional): Name of column in the original table containing the importation value to be analyzed. Defaults to ‘value’.
n (int, optional): Number of curves to plot. Defaults to 25.
Returns#
fig (plotly.graph_objects.Figure): Plotly Figure containing visualization.
Example#
import epidemic_intelligence as ei
from google.oauth2 import service_account
from google.cloud import bigquery
credentials = service_account.Credentials.from_service_account_file('../../../credentials.json') # use the path to your credentials
project = 'net-data-viz-handbook' # use your project name
# Initialize a GC client
client = bigquery.Client(credentials=credentials, project=project)
table_name = 'h1n1_R2.basins_prevalence_agg'
reference_table = 'reference.gleam-geo-map'
reference_column = 'basin_id' # name of a column in reference table
geo_column = 'basin_id' # name of a column in table corresponding to column in reference table
geo_level = 'basin_label'
geo_values = 'Portland(US-ME)'
value = 'Infectious_18_23'
sp_fig = ei.spaghetti_plot(
client=client,
table_name=table_name,
reference_table=reference_table,
geo_level=geo_level,
geo_values=geo_values,
geo_column=geo_column,
reference_column=reference_column,
value=value,
n=100)
# finishing touches
sp_fig.update_layout(width=900, height=500,
showlegend=True,
font_family='PT Sans Narrow',
title='Spaghetti Plot',)
sp_fig.show()
functional_boxplot#
A functional boxplot uses curve-based statistics that treat entire curves as a single data point, as opposed to each observation in a curve. Always plots the median and interquartile range.
Parameters#
client (bigquery.Client): BigQuery client object.
table_name (str): BigQuery table name containing data in ‘dataset.table’ form.
reference_table (str): BigQuery table name containing reference table in ‘dataset.table’ form.
geo_level (str): The name of a column from the reference table. The geographical level used to determine what places are included.
geo_values (str or listlike or None): The source(s) to be included. A value or subset of values from the geo_level column. If None, then all values will be included.
geo_column (str, optional): Name of column in original table containing geography identifier. Defaults to ‘basin_id’.
reference_column (str, optional): Name of column in original table containing the geography corresponding to data in source_column and target_column. Defaults to ‘basin_id’.
value (str, optional): Name of column in the original table containing the importation value to be analyzed. Defaults to ‘value’.
num_clusters (int, optional): Number of clusters that curves will be broken into based on grouping_method. Defaults to 1. Note: raising num_clusters above one significantly increases runtime.
num_features (int, optional): Number of features the kmeans algorithm will use to group curves if num_clusters in greater than 1. Must be less than or equal to number of run_ids in table.
grouping_method (str, optional): Method used to group curves. Must be one of:
'mse'(default): Fixed-time pairwise mean squared error between curves.'abc': Fixed-time pairwise area between curves. Also called mean absolute error.
kmeans_table (str, optional): BigQuery table name containing clustering information in ‘dataset.table’ form. Used when kmeans has already been performed with delete_data=False. Allows function to skip costly kmeans algorithm.
centrality_method (str, optional): Method used to determine curve centrality within their group. Must be one of:
'mse'(default): Summed fixed-time mean squared error between curves.'abc': Summed fixed-time pairwise area between curves. Also called mean absolute error.'mbd': Modified band depth. For more information, see Sun and Genton (2011).
threshold (float, optional): Number of interquantile ranges from median curve must be to not be considered an outlier. Defaults to 1.5.
dataset (str or None, optional): Name of BigQuery dataset to store intermediate tables. If None, then random hash value will be used. Defaults to None.
delete_data (bool, optional): If True, then intermediate data tables will not be deleted. Defaults to False.
overwrite (bool, optional): If True, then will not prompt for confirmation if overwriting an existing BigQuery dataset. Defaults to False.
Returns#
fig (plotly.graph_objects.Figure): Plotly Figure containing visualization.
Example#
# required
table_name = 'h1n1_R2.basins_prevalence_agg'
reference_table = 'reference.gleam-geo-map'
reference_column = 'basin_id' # name of a column in reference table
geo_column = 'basin_id' # name of a column in table corresponding to column in reference table
geo_level = 'basin_label'
geo_values = 'Portland(US-ME)'
value = 'Infectious_18_23'
# Set parameters for grouping
num_clusters = 1
num_features = 20
grouping_method = 'mse' # mean squared error
centrality_method = 'mse' # mean squared error
dataset = None
delete_data = True
fbp_fig = ei.functional_boxplot(
client=client,
table_name=table_name,
reference_table=reference_table,
geo_level=geo_level,
geo_values=geo_values,
geo_column=geo_column,
reference_column=reference_column,
value=value,
num_clusters=num_clusters,
num_features=num_features,
grouping_method=grouping_method,
centrality_method=centrality_method,
dataset=dataset,
delete_data=delete_data,
overwrite=True
)
# finishing touches
fbp_fig.update_layout(width=900, height=500,
showlegend=True,
font_family='PT Sans Narrow',
title='Functional Boxplot',
yaxis_title="Infectious 18-23yo"
)
fbp_fig.show()
Dataset `net-data-viz-handbook.717fb4a83d0d2c670816798c0103232eeb31674b262db45377e495c89164ffd1` created.
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[2], line 19
16 dataset = None
17 delete_data = True
---> 19 fbp_fig = ei.functional_boxplot(
20 client=client,
21 table_name=table_name,
22 reference_table=reference_table,
23 geo_level=geo_level,
24 geo_values=geo_values,
25 geo_column=geo_column,
26 reference_column=reference_column,
27 value=value,
28 num_clusters=num_clusters,
29 num_features=num_features,
30 grouping_method=grouping_method,
31 centrality_method=centrality_method,
32 dataset=dataset,
33 delete_data=delete_data,
34 overwrite=True
35 )
37 # finishing touches
38 fbp_fig.update_layout(width=900, height=500,
39 showlegend=True,
40 font_family='PT Sans Narrow',
41 title='Functional Boxplot',
42 yaxis_title="Infectious 18-23yo"
43 )
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\epidemic_intelligence\boxplots.py:338, in functional_boxplot(client, table_name, reference_table, geo_level, geo_values, geo_column, reference_column, value, num_clusters, num_features, grouping_method, kmeans_table, centrality_method, threshold, dataset, delete_data, overwrite)
318 # Step 8
319 get_mid_curves = f"""-- Step 3: Calculate min and max values at each time step using the non-outliers table
320 SELECT
321 data.date,
(...)
336 date;
337 """
--> 338 plt_middle = client.query(get_mid_curves).to_dataframe() # Execute and fetch results
340 get_outliers = f"""-- Step 3: Calculate min and max values at each time step using the non-outliers table
341 WITH outliers AS (
342 SELECT
(...)
365 date;
366 """
367 plt_outliers = client.query(get_outliers).to_dataframe() # Execute and fetch results
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\google\cloud\bigquery\client.py:3492, in Client.query(self, query, job_config, job_id, job_id_prefix, location, project, retry, timeout, job_retry, api_method)
3481 return _job_helpers.query_jobs_query(
3482 self,
3483 query,
(...)
3489 job_retry,
3490 )
3491 elif api_method == enums.QueryApiMethod.INSERT:
-> 3492 return _job_helpers.query_jobs_insert(
3493 self,
3494 query,
3495 job_config,
3496 job_id,
3497 job_id_prefix,
3498 location,
3499 project,
3500 retry,
3501 timeout,
3502 job_retry,
3503 )
3504 else:
3505 raise ValueError(f"Got unexpected value for api_method: {repr(api_method)}")
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\google\cloud\bigquery\_job_helpers.py:159, in query_jobs_insert(client, query, job_config, job_id, job_id_prefix, location, project, retry, timeout, job_retry)
156 else:
157 return query_job
--> 159 future = do_query()
160 # The future might be in a failed state now, but if it's
161 # unrecoverable, we'll find out when we ask for it's result, at which
162 # point, we may retry.
163 if not job_id_given:
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\google\cloud\bigquery\_job_helpers.py:136, in query_jobs_insert.<locals>.do_query()
133 query_job = job.QueryJob(job_ref, query, client=client, job_config=job_config)
135 try:
--> 136 query_job._begin(retry=retry, timeout=timeout)
137 except core_exceptions.Conflict as create_exc:
138 # The thought is if someone is providing their own job IDs and they get
139 # their job ID generation wrong, this could end up returning results for
140 # the wrong query. We thus only try to recover if job ID was not given.
141 if job_id_given:
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\google\cloud\bigquery\job\query.py:1383, in QueryJob._begin(self, client, retry, timeout)
1363 """API call: begin the job via a POST request
1364
1365 See
(...)
1379 ValueError: If the job has already begun.
1380 """
1382 try:
-> 1383 super(QueryJob, self)._begin(client=client, retry=retry, timeout=timeout)
1384 except exceptions.GoogleAPICallError as exc:
1385 exc.message = _EXCEPTION_FOOTER_TEMPLATE.format(
1386 message=exc.message, location=self.location, job_id=self.job_id
1387 )
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\google\cloud\bigquery\job\base.py:746, in _AsyncJob._begin(self, client, retry, timeout)
743 # jobs.insert is idempotent because we ensure that every new
744 # job has an ID.
745 span_attributes = {"path": path}
--> 746 api_response = client._call_api(
747 retry,
748 span_name="BigQuery.job.begin",
749 span_attributes=span_attributes,
750 job_ref=self,
751 method="POST",
752 path=path,
753 data=self.to_api_repr(),
754 timeout=timeout,
755 )
756 self._set_properties(api_response)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\google\cloud\bigquery\client.py:833, in Client._call_api(self, retry, span_name, span_attributes, job_ref, headers, **kwargs)
829 if span_name is not None:
830 with create_span(
831 name=span_name, attributes=span_attributes, client=self, job_ref=job_ref
832 ):
--> 833 return call()
835 return call()
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\google\api_core\retry\retry_unary.py:293, in Retry.__call__.<locals>.retry_wrapped_func(*args, **kwargs)
289 target = functools.partial(func, *args, **kwargs)
290 sleep_generator = exponential_sleep_generator(
291 self._initial, self._maximum, multiplier=self._multiplier
292 )
--> 293 return retry_target(
294 target,
295 self._predicate,
296 sleep_generator,
297 timeout=self._timeout,
298 on_error=on_error,
299 )
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\google\api_core\retry\retry_unary.py:144, in retry_target(target, predicate, sleep_generator, timeout, on_error, exception_factory, **kwargs)
142 for sleep in sleep_generator:
143 try:
--> 144 result = target()
145 if inspect.isawaitable(result):
146 warnings.warn(_ASYNC_RETRY_WARNING)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\google\cloud\_http\__init__.py:482, in JSONConnection.api_request(self, method, path, query_params, data, content_type, headers, api_base_url, api_version, expect_json, _target_object, timeout, extra_api_info)
479 data = json.dumps(data)
480 content_type = "application/json"
--> 482 response = self._make_request(
483 method=method,
484 url=url,
485 data=data,
486 content_type=content_type,
487 headers=headers,
488 target_object=_target_object,
489 timeout=timeout,
490 extra_api_info=extra_api_info,
491 )
493 if not 200 <= response.status_code < 300:
494 raise exceptions.from_http_response(response)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\google\cloud\_http\__init__.py:341, in JSONConnection._make_request(self, method, url, data, content_type, headers, target_object, timeout, extra_api_info)
338 headers[CLIENT_INFO_HEADER] = self.user_agent
339 headers["User-Agent"] = self.user_agent
--> 341 return self._do_request(
342 method, url, headers, data, target_object, timeout=timeout
343 )
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\google\cloud\_http\__init__.py:379, in JSONConnection._do_request(self, method, url, headers, data, target_object, timeout)
345 def _do_request(
346 self, method, url, headers, data, target_object, timeout=_DEFAULT_TIMEOUT
347 ): # pylint: disable=unused-argument
348 """Low-level helper: perform the actual API request over HTTP.
349
350 Allows batch context managers to override and defer a request.
(...)
377 :returns: The HTTP response.
378 """
--> 379 return self.http.request(
380 url=url, method=method, headers=headers, data=data, timeout=timeout
381 )
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\google\auth\transport\requests.py:538, in AuthorizedSession.request(self, method, url, data, headers, max_allowed_time, timeout, **kwargs)
535 remaining_time = guard.remaining_timeout
537 with TimeoutGuard(remaining_time) as guard:
--> 538 response = super(AuthorizedSession, self).request(
539 method,
540 url,
541 data=data,
542 headers=request_headers,
543 timeout=timeout,
544 **kwargs
545 )
546 remaining_time = guard.remaining_timeout
548 # If the response indicated that the credentials needed to be
549 # refreshed, then refresh the credentials and re-attempt the
550 # request.
551 # A stored token may expire between the time it is retrieved and
552 # the time the request is made, so we may need to try twice.
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\requests\sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
584 send_kwargs = {
585 "timeout": timeout,
586 "allow_redirects": allow_redirects,
587 }
588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
591 return resp
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\requests\sessions.py:703, in Session.send(self, request, **kwargs)
700 start = preferred_clock()
702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
705 # Total elapsed time of the request (approximately)
706 elapsed = preferred_clock() - start
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\requests\adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
483 timeout = TimeoutSauce(connect=timeout, read=timeout)
485 try:
--> 486 resp = conn.urlopen(
487 method=request.method,
488 url=url,
489 body=request.body,
490 headers=request.headers,
491 redirect=False,
492 assert_same_host=False,
493 preload_content=False,
494 decode_content=False,
495 retries=self.max_retries,
496 timeout=timeout,
497 chunked=chunked,
498 )
500 except (ProtocolError, OSError) as err:
501 raise ConnectionError(err, request=request)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\urllib3\connectionpool.py:790, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
787 response_conn = conn if not release_conn else None
789 # Make the request on the HTTPConnection object
--> 790 response = self._make_request(
791 conn,
792 method,
793 url,
794 timeout=timeout_obj,
795 body=body,
796 headers=headers,
797 chunked=chunked,
798 retries=retries,
799 response_conn=response_conn,
800 preload_content=preload_content,
801 decode_content=decode_content,
802 **response_kw,
803 )
805 # Everything went great!
806 clean_exit = True
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\urllib3\connectionpool.py:536, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
534 # Receive the response from the server
535 try:
--> 536 response = conn.getresponse()
537 except (BaseSSLError, OSError) as e:
538 self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\urllib3\connection.py:461, in HTTPConnection.getresponse(self)
458 from .response import HTTPResponse
460 # Get the response from http.client.HTTPConnection
--> 461 httplib_response = super().getresponse()
463 try:
464 assert_header_parsing(httplib_response.msg)
File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2544.0_x64__qbz5n2kfra8p0\Lib\http\client.py:1395, in HTTPConnection.getresponse(self)
1393 try:
1394 try:
-> 1395 response.begin()
1396 except ConnectionError:
1397 self.close()
File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2544.0_x64__qbz5n2kfra8p0\Lib\http\client.py:325, in HTTPResponse.begin(self)
323 # read until we get a non-100 response
324 while True:
--> 325 version, status, reason = self._read_status()
326 if status != CONTINUE:
327 break
File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2544.0_x64__qbz5n2kfra8p0\Lib\http\client.py:286, in HTTPResponse._read_status(self)
285 def _read_status(self):
--> 286 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
287 if len(line) > _MAXLINE:
288 raise LineTooLong("status line")
File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2544.0_x64__qbz5n2kfra8p0\Lib\socket.py:706, in SocketIO.readinto(self, b)
704 while True:
705 try:
--> 706 return self._sock.recv_into(b)
707 except timeout:
708 self._timeout_occurred = True
File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2544.0_x64__qbz5n2kfra8p0\Lib\ssl.py:1314, in SSLSocket.recv_into(self, buffer, nbytes, flags)
1310 if flags != 0:
1311 raise ValueError(
1312 "non-zero flags not allowed in calls to recv_into() on %s" %
1313 self.__class__)
-> 1314 return self.read(nbytes, buffer)
1315 else:
1316 return super().recv_into(buffer, nbytes, flags)
File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2544.0_x64__qbz5n2kfra8p0\Lib\ssl.py:1166, in SSLSocket.read(self, len, buffer)
1164 try:
1165 if buffer is not None:
-> 1166 return self._sslobj.read(len, buffer)
1167 else:
1168 return self._sslobj.read(len)
KeyboardInterrupt:
fixed_time_boxplot#
A fixted-time boxplot uses fixed-time statistics that rank each point at each time step, and use those to construct confidence intervals for each time step. Always plots the median and interquartile range.
Parameters#
client (bigquery.Client): BigQuery client object.
table_name (str): BigQuery table name containing data in ‘dataset.table’ form.
reference_table (str): BigQuery table name containing reference table in ‘dataset.table’ form.
geo_level (str): The name of a column from the reference table. The geographical level used to determine what places are included.
geo_values (str or listlike or None): The source(s) to be included. A value or subset of values from the geo_level column. If None, then all values will be included.
geo_column (str, optional): Name of column in original table containing geography identifier. Defaults to ‘basin_id’.
reference_column (str, optional): Name of column in original table containing the geography corresponding to data in source_column and target_column. Defaults to ‘basin_id’.
value (str, optional): Name of column in the original table containing the importation value to be analyzed. Defaults to ‘value’.
num_clusters (int, optional): Number of clusters that curves will be broken into based on grouping_method. Defaults to 1. Note: raising num_clusters above one significantly increases runtime.
num_features (int, optional): Number of features the kmeans algorithm will use to group curves if num_clusters in greater than 1. Must be less than or equal to number of run_ids in table.
grouping_method (str, optional): Method used to group curves. Must be one of:
'mse'(default): Fixed-time pairwise mean squared error between curves.'abc': Fixed-time pairwise area between curves. Also called mean absolute error.
kmeans_table (str, optional): BigQuery table name containing clustering information in ‘dataset.table’ form. Used when kmeans has already been performed with delete_data=False. Allows function to skip costly kmeans algorithm.
dataset (str or None, optional): Name of BigQuery dataset to store intermediate tables. If None, then random hash value will be used. Defaults to None.
delete_data (bool, optional): If True, then intermediate data tables will not be deleted. Defaults to False.
overwrite (bool, optional): If True, then will not prompt for confirmation if overwriting an existing BigQuery dataset. Defaults to False.
confidence (float, optional): From 0 to 1. Confidence level of interval that will be graphed. Also determines which points are considered outliers.
full_range (bool, optional): If True, then mesh will be drawn around entire envelope, including outliers. Defaults to False.
outlying_points (bool, optional): If True, then outlying points will be graphed. Defaults to True.
Returns#
fig (plotly.graph_objects.Figure): Plotly Figure containing visualization.
Example#
# required
table_name = 'h1n1_R2.basins_prevalence_agg'
reference_table = 'reference.gleam-geo-map'
reference_column = 'basin_id' # name of a column in reference table
geo_column = 'basin_id' # name of a column in table corresponding to column in reference table
geo_level = 'basin_label'
geo_values = 'Portland(US-ME)'
value = 'Infectious_18_23'
# Set parameters for grouping
num_clusters = 1
num_features = 20
grouping_method = 'mse' # mean squared error
confidence = .95
dataset = None
delete_data = True
ft_fig = ei.fixed_time_boxplot(
client,
table_name,
reference_table,
geo_level,
geo_values,
geo_column=geo_column,
reference_column=reference_column,
num_clusters=num_clusters,
num_features=num_features,
grouping_method=grouping_method,
value=value,
dataset=dataset,
delete_data=delete_data,
kmeans_table=False,
confidence=confidence,
full_range=True,
outlying_points=False,
)
# finishing touches
ft_fig.update_layout(width=900, height=500,
showlegend=True,
font_family='PT Sans Narrow',
title='Traditional Boxplot',)
ft_fig.update_layout(showlegend=True)
Dataset `net-data-viz-handbook.078fee40a39ef9a8912e762a6e6dc113d11d3fb7f2be58ac319e13c9b90f900a` created.
BigQuery dataset `net-data-viz-handbook.078fee40a39ef9a8912e762a6e6dc113d11d3fb7f2be58ac319e13c9b90f900a` removed successfully, or it did not exist.
fetch_fixed_time_quantiles#
Allows calculation of custom fixed-time quantiles. Always fetches median.
Parameters#
client (bigquery.Client): BigQuery client object.
table_name (str): BigQuery table name containing data in ‘dataset.table’ form.
reference_table (str): BigQuery table name containing reference table in ‘dataset.table’ form.
confidences (list of float): List of confidences to gather, from 0 to 1. For example, entering .5 will result in the 25th and 75th percentiles being calculated.
geo_level (str): The name of a column from the reference table. The geographical level used to determine what places are included.
geo_values (str or listlike or None): The geographies to be included. A value or subset of values from the geo_level column. If None, then all values will be included.
geo_column (str, optional): Name of column in original table containing geography identifier. Defaults to ‘basin_id’.
reference_column (str, optional): Name of column in original table containing the geography corresponding to data in source_column and target_column. Defaults to ‘basin_id’.
value (str, optional): Name of column in the original table containing the importation value to be analyzed. Defaults to ‘value’.
num_clusters (int, optional): Number of clusters that curves will be broken into based on grouping_method. Defaults to 1. Note: raising num_clusters above one significantly increases runtime.
num_features (int, optional): Number of features the kmeans algorithm will use to group curves if num_clusters in greater than 1. Must be less than or equal to number of run_ids in table.
grouping_method (str, optional): Method used to group curves. Must be one of:
'mse'(default): Fixed-time pairwise mean squared error between curves.'abc': Fixed-time pairwise area between curves. Also called mean absolute error.
kmeans_table (str, optional): BigQuery table name containing clustering information in ‘dataset.table’ form. Used when kmeans has already been performed with delete_data=False. Allows function to skip costly kmeans algorithm.
dataset (str or None, optional): Name of BigQuery dataset to store intermediate tables. If None, then random hash value will be used. Defaults to None.
delete_data (bool, optional): If True, then intermediate data tables will not be deleted. Defaults to False.
overwrite (bool, optional): If True, then will not prompt for confirmation if overwriting an existing BigQuery dataset. Defaults to False.
Returns#
df (pandas.DataFrame): pandas dataframe containing quantiles and median.
Example#
# uses the same parameters as fixed_time_boxplot!
df_ft = ei.boxplots.fetch_fixed_time_quantiles(
client=client,
table_name=table_name,
reference_table=reference_table,
confidences=[.9, .5], # just introduce the confidences parameter
geo_level=geo_level,
geo_values=geo_values,
geo_column=geo_column,
reference_column=reference_column,
num_clusters=num_clusters,
num_features=num_features,
grouping_method=grouping_method,
value=value,
dataset=dataset,
delete_data=delete_data,
kmeans_table=False,
)
df_ft
Dataset `net-data-viz-handbook.ab7c684e00be1a0a79bf20da8828ed72fe66bf1ce58fb99fd4ba1a06a0ade3aa` created.
---------------------------------------------------------------------------
BadRequest Traceback (most recent call last)
Cell In[21], line 2
1 # uses the same parameters as fixed_time_boxplot!
----> 2 df_ft = ei.boxplots.fetch_fixed_time_quantiles(
3 client=client,
4 table_name=table_name,
5 reference_table=reference_table,
6 confidences=[.9, .5], # just introduce the confidences parameter
7 geo_level=geo_level,
8 geo_values=geo_values,
9 geo_column=geo_column,
10 reference_column=reference_column,
11 num_clusters=num_clusters,
12 num_features=num_features,
13 grouping_method=grouping_method,
14 value=value,
15 dataset=dataset,
16 delete_data=delete_data,
17 kmeans_table=False,
18 )
20 df_ft
File ~\Documents\24f-coop\demovenv\Lib\site-packages\epidemic_intelligence\boxplots.py:967, in fetch_fixed_time_quantiles(client, table_name, reference_table, geo_level, geo_values, confidences, geo_column, reference_column, num_clusters, num_features, grouping_method, value, dataset, delete_data, kmeans_table, overwrite)
946 # print(', '.join(clause for clause in conf_clause))
948 fixed_time_quantiles = f'''
949 WITH centroid_data AS (
950 SELECT
(...)
964 ORDER BY CENTROID_ID, date;
965 '''
--> 967 df = client.query(fixed_time_quantiles).result().to_dataframe() # Execute the query to create the table
969 if delete_data:
970 client.delete_dataset(
971 dataset,
972 delete_contents=True, # Set to False if you only want to delete an empty dataset
973 not_found_ok=True # If True, no error is raised if the dataset does not exist
974 )
File ~\Documents\24f-coop\demovenv\Lib\site-packages\google\cloud\bigquery\job\query.py:1681, in QueryJob.result(self, page_size, max_results, retry, timeout, start_index, job_retry)
1676 remaining_timeout = None
1678 if remaining_timeout is None:
1679 # Since is_job_done() calls jobs.getQueryResults, which is a
1680 # long-running API, don't delay the next request at all.
-> 1681 while not is_job_done():
1682 pass
1683 else:
1684 # Use a monotonic clock since we don't actually care about
1685 # daylight savings or similar, just the elapsed time.
File ~\Documents\24f-coop\demovenv\Lib\site-packages\google\api_core\retry\retry_unary.py:293, in Retry.__call__.<locals>.retry_wrapped_func(*args, **kwargs)
289 target = functools.partial(func, *args, **kwargs)
290 sleep_generator = exponential_sleep_generator(
291 self._initial, self._maximum, multiplier=self._multiplier
292 )
--> 293 return retry_target(
294 target,
295 self._predicate,
296 sleep_generator,
297 timeout=self._timeout,
298 on_error=on_error,
299 )
File ~\Documents\24f-coop\demovenv\Lib\site-packages\google\api_core\retry\retry_unary.py:153, in retry_target(target, predicate, sleep_generator, timeout, on_error, exception_factory, **kwargs)
149 # pylint: disable=broad-except
150 # This function explicitly must deal with broad exceptions.
151 except Exception as exc:
152 # defer to shared logic for handling errors
--> 153 _retry_error_helper(
154 exc,
155 deadline,
156 sleep,
157 error_list,
158 predicate,
159 on_error,
160 exception_factory,
161 timeout,
162 )
163 # if exception not raised, sleep before next attempt
164 time.sleep(sleep)
File ~\Documents\24f-coop\demovenv\Lib\site-packages\google\api_core\retry\retry_base.py:212, in _retry_error_helper(exc, deadline, next_sleep, error_list, predicate_fn, on_error_fn, exc_factory_fn, original_timeout)
206 if not predicate_fn(exc):
207 final_exc, source_exc = exc_factory_fn(
208 error_list,
209 RetryFailureReason.NON_RETRYABLE_ERROR,
210 original_timeout,
211 )
--> 212 raise final_exc from source_exc
213 if on_error_fn is not None:
214 on_error_fn(exc)
File ~\Documents\24f-coop\demovenv\Lib\site-packages\google\api_core\retry\retry_unary.py:144, in retry_target(target, predicate, sleep_generator, timeout, on_error, exception_factory, **kwargs)
142 for sleep in sleep_generator:
143 try:
--> 144 result = target()
145 if inspect.isawaitable(result):
146 warnings.warn(_ASYNC_RETRY_WARNING)
File ~\Documents\24f-coop\demovenv\Lib\site-packages\google\cloud\bigquery\job\query.py:1630, in QueryJob.result.<locals>.is_job_done()
1607 if job_failed_exception is not None:
1608 # Only try to restart the query job if the job failed for
1609 # a retriable reason. For example, don't restart the query
(...)
1627 # into an exception that can be processed by the
1628 # `job_retry` predicate.
1629 restart_query_job = True
-> 1630 raise job_failed_exception
1631 else:
1632 # Make sure that the _query_results are cached so we
1633 # can return a complete RowIterator.
(...)
1639 # making any extra API calls if the previous loop
1640 # iteration fetched the finished job.
1641 self._reload_query_results(
1642 retry=retry, **reload_query_results_kwargs
1643 )
BadRequest: 400 Unrecognized name: DISINCT at [11:16]; reason: invalidQuery, location: query, message: Unrecognized name: DISINCT at [11:16]
Location: US
Job ID: 53cab13b-8c0f-4960-affe-9c835ea20513